frequentist regret
Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation
We study reinforcement learning with _multinomial logistic_ (MNL) function approximation where the underlying transition probability kernel of the _Markov decision processes_ (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- Europe > United Kingdom > North Sea > Southern North Sea (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (4 more...)
the paper thoroughly and incorporate all the comments
We thank all three reviewers for their careful readings, valuable questions and constructive suggestions. Thanks for raising the questions on infinite arms and frequentist regret bound for DTS. Thanks for raising this point and we will add a remark on it. We will add a remark on this point in the revision. We will add a remark on this point in the revision.
- North America > United States > Michigan (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Michigan (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- Europe > United Kingdom > North Sea > Southern North Sea (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (4 more...)
Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation
We study reinforcement learning with _multinomial logistic_ (MNL) function approximation where the underlying transition probability kernel of the _Markov decision processes_ (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. Here, d is the dimension of the transition core, H is the horizon length, T is the total number of steps, and \kappa is a problem-dependent constant. Despite the simplicity and practicality of \texttt{RRL-MNL}, its regret bound scales with \kappa {-1}, which is potentially large in the worst case. To improve the dependence on \kappa {-1}, we propose \texttt{ORRL-MNL}, which estimates the value function using local gradient information of the MNL transition model.
Improved Regret of Linear Ensemble Sampling
In this work, we close the fundamental gap of theory and practice by providing an improved regret bound for linear ensemble sampling. We prove that with an ensemble size logarithmic in $T$, linear ensemble sampling can achieve a frequentist regret bound of $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$, matching state-of-the-art results for randomized linear bandit algorithms, where $d$ and $T$ are the dimension of the parameter and the time horizon respectively. Our approach introduces a general regret analysis framework for linear bandit algorithms. Additionally, we reveal a significant relationship between linear ensemble sampling and Linear Perturbed-History Exploration (LinPHE), showing that LinPHE is a special case of linear ensemble sampling when the ensemble size equals $T$. This insight allows us to derive a new regret bound of $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ for LinPHE, independent of the number of arms. Our contributions advance the theoretical foundation of ensemble sampling, bringing its regret bounds in line with the best known bounds for other randomized exploration algorithms.
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Geometry-Aware Approaches for Balancing Performance and Theoretical Guarantees in Linear Bandits
This paper is motivated by recent research in the $d$-dimensional stochastic linear bandit literature, which has revealed an unsettling discrepancy: algorithms like Thompson sampling and Greedy demonstrate promising empirical performance, yet this contrasts with their pessimistic theoretical regret bounds. The challenge arises from the fact that while these algorithms may perform poorly in certain problem instances, they generally excel in typical instances. To address this, we propose a new data-driven technique that tracks the geometric properties of the uncertainty ellipsoid around the main problem parameter. This methodology enables us to formulate an instance-dependent frequentist regret bound, which incorporates the geometric information, for a broad class of base algorithms, including Greedy, OFUL, and Thompson sampling. This result allows us to identify and ``course-correct" problem instances in which the base algorithms perform poorly. The course-corrected algorithms achieve the minimax optimal regret of order $\tilde{\mathcal{O}}(d\sqrt{T})$ for a $T$-period decision-making scenario, effectively maintaining the desirable attributes of the base algorithms, including their empirical efficacy. We present simulation results to validate our findings using synthetic and real data.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Asia > China (0.04)